Experiments on determinism

  • Author: Fang Zhang
  • Date: 2016.6.15
  • E-mail: fza34@sfu.ca

Experiment 20160525

  1. Use fast tools to shuffle the sample fast file by using:
    fastq-sort -R —seed=3 input.fastq > output.fastq.
    
  2. Run pipeline on shuffled fast files, very slow. Should check the order in the two fastq files.
  3. Download the Beijing lineages.

Experiment 20160526

  1. Run pipeline on shuffled samples and compare the results with unshuffled.
    nohup ./polyTB_Breezy_try.sh /home/zhf615/TB_test/MALAWI/sample /home/zhf615/TB_test/MALAWI/reference/AL123456.2 TBSequence ERR245648 /global/scratch/zhf615/MALAWI/data/fastq/ /home/zhf615/TB_test/MALAWI/input/RESI_SNPs/Resi-List-MasterV27_.vcf.gz &
    
  2. Test Beijing lineage sequences.
  3. Check the order of shuffled samples.
    cat  ERR245648_1_random.fastq | grep @ERR245648 | awk '{print $1}' | awk  -F '.' '{print $2}'  > ERR245648_1.txt
    cat  ERR245648_2_random.fastq | grep @ERR245648 | awk '{print $1}' | awk  -F '.' '{print $2}'  > ERR245648_2.txt
    
    Orders are not the same and have to use other shuffle methods.

Experiment 20160530

  1. Use the shuffle code given by the paper.
  2. terms:
  • GATK = SNPs found with GATK
  • MPILEUP = SNPS found by MPILEUP
  • INTERSSECTION = intersection of both sets of SNPs
  • dedup = duplicate mappinga are deleted
  • shuffled = samples are shuffled
  • lowQRM = low quality mappings are removed
  • 17 = SNP with quality lower than 17 is removed
  1. Compare the results between shuffled and unshuffled 3 samples, the results are in excel file below:
GATK MPILEUP INTERSECTION SNP
ERR212128 1001 1341 982 7
ERR212129 938 1206 914 12
ERR212130 941 1221 925 12
dedup
ERR212128 1001 1341 982 7
ERR212129 938 1206 914 12
ERR212130 941 1221 925 12
shuffled
ERR212128 1001 1338 982 7
ERR212129 937 1212 914 12
ERR212130 939 1227 925 12
shuffled_dedup
ERR212128 1001 1338 982 7
ERR212129 937 1212 914 12
ERR212130 939 1227 925 12
lowQRM
ERR212128 1001 1375 989 7
ERR212129 938 1237 918 12
ERR212130 941 1243 929 12
dedup_lowQRM
ERR212128 1001 1375 989 7
ERR212129 938 1237 918 12
ERR212130 941 1243 929 12
shuffle_lowQRM
ERR212128 1001 1373 989 7
ERR212129 937 1236 917 12
ERR212130 939 1241 927 12
dedup_shuffle_lowQRM
ERR212128 1001 1373 989 7
ERR212129 937 1236 917 12
ERR212130 939 1241 927 12
normal_17
ERR212128 1001 1255 977 7
ERR212129 938 1107 906 12
ERR212130 941 1127 918 12
lowQRM_17
ERR212128 1001 1273 982 7
ERR212129 938 1123 912 12
ERR212130 941 1139 924 12
shuffled_lowQRM_17
ERR212128 1001 1274 982 7
ERR212129 937 1125 912 12
ERR212130 939 1137 923 12
shuffled_17_1thread
ERR212128 1001 1274 982 7
ERR212129 937 1125 912 12
ERR212130 939 1137 923 12
normal_17_4thread
ERR212128 999 1260 977 7
ERR212129 935 1107 906 12
ERR212130 942 1124 918 12
  1. The results are not deterministic. Filter some low quality reads and base, GATK -nbq 30 filter 30, mpileup -Q 15, the 3 samples can have identical results in number and SNPs.
GATK MPILEUP INTERSECTION SNP
normal
ERR212128 1002 1338 987 7
ERR212129 925 1204 911 12
ERR212130 927 1218 915 12
shuffled
ERR212128 1002 1337 987 7
ERR212129 925 1209 911 12
ERR212130 927 1223 915 12
  1. Try another 3 samples using the same parameters. They cannot have the same results which means GATK and mpileup may not be deterministic.
GATK MPILEUP INTERSECTION SNP
shuffled
ERR212118 949 1222 938 12
ERR212127 954 1231 945 13
ERR212132 903 1210 896 9
ERR212128 1002 1369 992 7
ERR212129 925 1230 912 12
ERR212130 927 1236 917 12
normal
ERR212118 948 1222 937 12
ERR212127 953 1221 944 13
ERR212132 902 1206 895 9
ERR212128 1002 1369 992 7
ERR212129 925 1232 912 12
ERR212130 927 1238 917 12

Experiment 20160604

  1. As GATK and mpileup both are not deterministic (suppose BWA mem is deterministic), try freebayes.
GATK MPILEUP INTERSECTION freebayes SNP
GATK -nbq 30 filter 25, mpileup -Q 10
normal
ERR212118 948 1254 938 1404 12
ERR212127 953 1240 944 1417 13
ERR212132 902 1246 895 1386 9
ERR212128 1002 1397 993 1490 7
ERR212129 925 1267 912 1373 12
ERR212130 927 1285 918 1394 12
shuffled
ERR212118 949 1256 939 1399 12
ERR212127 954 1250 945 1423 13
ERR212132 903 1248 896 1387 9
ERR212128 1002 1396 993 1489 7
ERR212129 925 1266 912 1369 12
ERR212130 927 1280 918 1390 12
GATK -nbq 25 filter 27, mpileup -Q 15, freebayes -q 25 filter 25
normal
ERR212118 952 1174 940 988 12
ERR212127 959 1166 947 1006 13
ERR212132 919 1161 908 994 9
ERR212128 1003 1331 989 1103 7
ERR212129 931 1183 910 1001 12
ERR212130 933 1187 921 1002 12
shuffled
ERR212118 953 1173 941 989 12
ERR212127 959 1177 947 1010 13
ERR212132 919 1160 908 995 9
ERR212128 1003 1329 989 1104 7
ERR212129 931 1179 910 1001 12
ERR212130 932 1182 920 1003 12
GATK -nbq 25 filter 30, mpileup -Q 25, freebayes -q 25 filter 30
normal
ERR212118 952 1174 940 985 12
ERR212127 959 1166 947 1005 13
ERR212132 919 1161 908 993 9
ERR212128 1003 1331 989 1100 7
ERR212129 931 1183 910 998 12
ERR212130 933 1187 921 999 12
shuffled
ERR212118 953 1173 941 986 12
ERR212127 959 1177 947 1009 13
ERR212132 919 1160 908 991 9
ERR212128 1003 1329 989 1101 7
ERR212129 931 1179 910 994 12
ERR212130 932 1182 920 999 12
GATK -nbq 25 filter 200, mpileup -Q 25, freebayes -q 25 filter 200
normal
ERR212118 908 1174 903 909 12
ERR212127 923 1166 918 932 13
ERR212132 877 1161 875 922 9
ERR212128 963 1331 959 1016 7
ERR212129 888 1183 882 900 12
ERR212130 895 1187 889 925 12
shuffled
ERR212118 908 1173 903 909 12
ERR212127 922 1177 917 932 13
ERR212132 877 1160 875 921 9
ERR212128 962 1329 958 1016 7
ERR212129 888 1179 882 901 12
ERR212130 895 1182 889 927 12
  1. All SNP callers are not deterministic, however, they all have the same results generated by the pipeline with the same parameters. Therefore, it seems that BWA is not deterministic that leads to undeterminism of GATK, mpileup and freebayes.

Experiment 20160608

  1. mrFAST can have deterministic results on shuffled and unshuffled samples.
  2. However it cannot work on fastq files whose reads have different lengths. Most samples in Beijing dataset have reads of different lengths.
  3. use tool from Yanyi Lin to merge 2 fastq files and filter them by length.

    /home/zhf615/TB_test/MALAWI/tools/fastq_merge/extract -m 14 -s test_1.fastq -p test_2.fastq -o final.fastq
    cat final.fastq | paste - - - - - - - -|awk 'length($3)>=100 &&length($8)>=100 ' | sed 's/\t/\n/g'  > final.fastq
    /home/zhf615/TB_test/MALAWI/tools/fastq_merge/extract -m 17 -p final.fastq -o final
    
  4. Because reads have different lengths, parameter —-crop 100 should be used in mrFAST command.


In [ ]: